The final homework
Author

Lindsay Jones

Published

December 4, 2022

Homework 5

Setup

Code
library(alr4)
Loading required package: car
Loading required package: carData
Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 

Question 1

Code
data(house.selling.price.2)
head(house.selling.price.2)
      P    S Be Ba New
1  48.5 1.10  3  1   0
2  55.0 1.01  3  2   0
3  68.0 1.45  3  2   0
4 137.0 2.40  3  3   0
5 309.4 3.30  4  3   1
6  17.5 0.40  1  1   0

A) For backward elimination, which variable would be deleted first? Why?

The BEDS variable should be eliminated first because it has the highest p-value and is therefore the weakest predictor of price.

B) For forward selection, which variable would be added first? Why?

In forward selection, we add based on the strongest significance. The p-value for SIZE is 0 and the p-value for NEW is 0.000. I am inclined to believe that NEW appears this way because it has a value slightly higher than 0, so I would estimate that SIZE is the most statistically significant and should be added first.

C) Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

I think the correlation between BEDS and PRICE is influenced by another variable, perhaps SIZE. On its own, BEDS may just be a weak indicator of PRICE,

D) Using software with these four predictors, find the model that would be selected using each criterion:

a) R²

b) Adjusted R²

Code
m1 <- lm(P ~ S, house.selling.price.2)
m2 <- lm(P ~ Be, house.selling.price.2)
m3 <- lm(P ~ Ba, house.selling.price.2)
m4 <- lm(P ~ New, house.selling.price.2)
m5 <- lm(P~., house.selling.price.2)
m6 <- lm(P~.-Be, house.selling.price.2)
stargazer(m1, m2, m3, m4, m5, m6, type = 'text')

================================================================================================================================================================
                                                                                Dependent variable:                                                             
                    --------------------------------------------------------------------------------------------------------------------------------------------
                                                                                         P                                                                      
                              (1)                    (2)                    (3)                    (4)                     (5)                     (6)          
----------------------------------------------------------------------------------------------------------------------------------------------------------------
S                          75.607***                                                                                    64.761***               62.263***       
                            (3.865)                                                                                      (5.630)                 (4.335)        
                                                                                                                                                                
Be                                                42.969***                                                              -2.766                                 
                                                   (6.160)                                                               (3.960)                                
                                                                                                                                                                
Ba                                                                       76.026***                                      19.203***               20.072***       
                                                                          (7.822)                                        (5.650)                 (5.495)        
                                                                                                                                                                
New                                                                                             34.158***               18.984***               18.371***       
                                                                                                 (9.383)                 (3.873)                 (3.761)        
                                                                                                                                                                
Constant                  -25.194***               -37.229*              -49.248***             89.249***              -41.795***              -47.992***       
                            (6.688)                (19.955)               (15.644)               (5.148)                (12.104)                 (8.209)        
                                                                                                                                                                
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Observations                  93                      93                     93                     93                     93                      93           
R2                           0.808                  0.348                  0.509                  0.127                   0.869                   0.868         
Adjusted R2                  0.806                  0.341                  0.504                  0.118                   0.863                   0.864         
Residual Std. Error    19.473 (df = 91)        35.861 (df = 91)       31.119 (df = 91)       41.506 (df = 91)       16.360 (df = 88)        16.313 (df = 89)    
F Statistic         382.628*** (df = 1; 91) 48.660*** (df = 1; 91) 94.473*** (df = 1; 91) 13.254*** (df = 1; 91) 145.763*** (df = 4; 88) 195.313*** (df = 3; 89)
================================================================================================================================================================
Note:                                                                                                                                *p<0.1; **p<0.05; ***p<0.01

For parts a and b above, I compare several models in one chart (I did not run all possible models; just a few that made sense based on previous information). Because a higher R² indicates a stronger fit, the model including all variables (m5) should be selected if using R² and the model including all except BEDS (m6) should be selected if using adjusted R².

c) PRESS

Code
#use PRESS function to test all models from previous questions

PRESS <- function(linear.model) {
  pr <- residuals(linear.model)/(1-lm.influence(linear.model)$hat)
  PRESS <- sum(pr^2)
  return(PRESS)
}

cat('       ', 'PRESS', '\n',
'm1:  ', PRESS(m1), '\n',
'm2:  ', PRESS(m2), '\n',
'm3:  ', PRESS(m3), '\n',
'm4:  ', PRESS(m4), '\n',
'm5:  ', PRESS(m5), '\n',
'm6:  ', PRESS(m6), '\n')
        PRESS 
 m1:   38203.29 
 m2:   122984.3 
 m3:   95732.14 
 m4:   164039.3 
 m5:   28390.22 
 m6:   27860.05 

If using PRESS as the primary criterion, model m6 is the best choice because it has the lowest PRESS score.

d) AIC

Code
cat('       ', 'AIC', '\n',
'm1:  ', AIC(m1), '\n',
'm2:  ', AIC(m2), '\n',
'm3:  ', AIC(m3), '\n',
'm4:  ', AIC(m4), '\n',
'm5:  ', AIC(m5), '\n',
'm6:  ', AIC(m6), '\n')
        AIC 
 m1:   820.1439 
 m2:   933.7168 
 m3:   907.3327 
 m4:   960.908 
 m5:   790.6225 
 m6:   789.1366 

m6 is the best choice here, but m5 is close.

c) BIC

Code
cat('       ', 'BIC', '\n',
'm1:  ', BIC(m1), '\n',
'm2:  ', BIC(m2), '\n',
'm3:  ', BIC(m3), '\n',
'm4:  ', BIC(m4), '\n',
'm5:  ', BIC(m5), '\n',
'm6:  ', BIC(m6), '\n')
        BIC 
 m1:   827.7417 
 m2:   941.3146 
 m3:   914.9305 
 m4:   968.5058 
 m5:   805.8181 
 m6:   801.7996 

Similar to AIC, m6 is again the best choice here, but m5 is close.

d) Explain which model you prefer and why.

It seems to me that m6, which includes all variables but BEDS, is the best fit. It meets the criterion for best fit in most tests. Model m5, which includes all variables, would be a close 2nd choice.

Question 2

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,

A) fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables

Code
tree <- lm(Volume ~ Girth + Height, data=trees)
tree

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Coefficients:
(Intercept)        Girth       Height  
   -57.9877       4.7082       0.3393  

B) Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions are violated?

Code
par(mfrow = c(2,3)); plot(tree, which = 1:6)

The Residuals vs Fitted plot violates the assumption of linearity. The line in the Scale-Location plot is not quite horizontal, but it is unclear which assumption this violates- perhaps homoskedasticity.

Question 3

In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.

The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan.

A) Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

Code
vote <- lm(Buchanan ~ Bush, data=florida)
par(mfrow = c(2,3)); plot(vote, which = 1:6)

Palm Beach does appear to be an outlier as it falls way outside the lines on most plots. Dade is also an outlier on a few plots.

B) Take the log of both variables (Bush vote and Buchanan Vote) and repeat the analysis in (a). Does your findings change?

Code
vote_log <- lm(log(Buchanan)~log(Bush),data=florida)
par(mfrow = c(2,3)); plot(vote_log, which = 1:6)

Palm Beach still appears to be a significant outlier, but now there are other counties which also stand out. Dade no longer appears to be an outlier on any plot.